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TELECOMFEREMCING SYSTEM 



This invention relates to audio teleconferencing systems. These are systems in 
which three or more participants, each having a telephone connection, can 
5 participate in a multi-way discussion. The essential part of a teleconference system 
is called the conference "bridge", and is where the audio signals from all the 
participants are combined. Conference bridges presently function by receiving 
audio from each of the participants, appropriately mixing the audio signals, and 
then distributing the mixed signal to each of the participants. All signal processing 

10 is concentrated in the bridge, and the result is monaural (that is, there is a single 
sound channel). This arrangement is shown in Figure 1, which will be described in 
detail later. The principal drawback with this system is that the audio quality is 
monophonic, generally poor, it is very difficult to determine which participants are 
speaking at any one time, especially when the number of participants is large. 

15 According to the invention, there is provided a teleconferencing system 

comprising a conference bridge having a multichannel connection to each customer 
equipment, the customer equipment having means to separately process each 
channel to provide an output, preferably spatialised, representing each of the other 
participants. Preferably the conference bridge comprises a concentrator, having 

20 means to identify the currently active input channels and to transmit only those 
active channels over the multichannel connection, together with control 
information identifying the transmitted channels. This reduces the capacity 
required by the multichannel connection. The control information identifying the 
active channels may be carried in a separate control channel, or as an overhead on 

25 the active subset of channels. In a preferred arrangement the channel representing 
a given participant is excluded from the output provided to that participant. This 
may be achieved by excluding that channel from the processing in the customer 
equipment, but is preferably achieved by excluding it from the multichannel 
transmission from the bridge to that participant, thereby reducing further the 

30 capacity required by the multichannel connection. 

Exemplary embodiments of the invention will now be described, by way of 
example, with reference to the drawings, in which: 

Figure 1 illustrates a conventional teleconference system; 



Figure 2 illustrates a spatial audio teleconference system according to one 
embodiment of the invention; 

Figure 3 illustrates a N-channel speech decoder used in the embodiment of 
Figure 2; 

5 Figure 4 illustrates a N-Channel audio spatialiser used in the embodiment 

of Figure 2; 

Figure 5 illustrates a second embodiment of the invention; 
Figure 6 illustrates how the invention may be used with conventional 
PSTN channels; 

10 Figure 7 illustrates a variant of the invention for use with a video 

conference system; 

Figure 8 illustrates a voice switched concentrator which may be used in 
the embodiments of the invention. 

Figures 9, 10, and 1 1 illustrate various echo cancellation techniques. 

15 In the conventional system illustrated in Figure 1 the conference bridge 

located in the exchange equipment 100 receives signals from the various customer 
equipments 10, (20, 30 not shown) in response to sounds detected by respective 
microphones 11, 21, 31 etc. These signals are transmitted over the telephone 
network (1), to the exchange 100 at which the bridge is established. Generally the 

20 signals will travel by way of a local exchange (not shown) in which the analogue 
signals are converted to digital form, usually employing linear companding such as 
"A law" (as used for example in Europe) or "mu-Law" (as used for example in the 
United States of America) for onward transmission to the bridge exchange 100. On 
arrival at the bridge exchange 100, the bridge passes each incoming signal 11, 21, 

25 31 through a respective digital converter 111, 112, 113 to convert them from A 
Law to linear digital signals, and then passes the linear signals to a digital combiner 
120 to generate a combined signal. This combined signal is re-converted to A law 
in a further digital converter 110, and the resulting signal transmitted over the 
telephone network (2) to each customer equipment 10, (20, 30) for conversion to 

30 sound in respective loudspeakers 12, 22, 32 etc. In this way the exchange 
equipment 100 acts as a "bridge" to allow one or more customer equipments 32 
to connect into a simple two-way connection between customer equipments 10, 
20. 
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The systems illustrated in Figures 2 to 8 replace the conventional 
conference bridge system of Figure 1 with a multicast system in which several 
channels can be transmitted to each participant, using a multi-channel link 
comprising an uplink 3, and a downlink comprising a control channel 4 and a digital 
5 audio downlink 5 comprising several channels 51, 52. Participants with suitable 
equipment can then process these channels 51, 52 in various ways as will be 
described. 

The transmission medium used for the uplink 3 and downlink 4,5 can be 
any suitable medium. ISDN (Integrated Services Data Network) technology or LAN 

10 (Local Area Network) - respectively public and private data networks - are the 
favoured transmission options since they provide adequate data rate and low 
latency - delays due to coding and transmission buffering. However, they are 
expensive and so far have a low penetration in the market place. Internet Protocol 
techniques are more widely used, but have poor latency and unreliable data rates. 

15 Being packet systems, they are less suited to voice applications. It is also possible 
to use the conventional PSTN (public switch telephone system) with a speech 
band modem. The latest internet type modems provide up to 56kbit/s downstream 
(links 4,5: digital network down to the customer via local loop), and up to 
28.8kbit/s upstream (link 3). They are low cost and are commonly bundled into PC 

20 packages. Ideally a system should be able to work with all of the above, and with 
standard analogue PSTN available as a backup. 

The signal mixing can take place either in the user's terminal equipment, or 
in a centralised processing platform as shown in Figure 2. In Figure 2 the customer 
equipment 10 contains a microphone 11 and loudspeaker system 12 as before. 

25 However, the loudspeaker system 12 is a spatialised system - that Is, it has two or 
more channels to allow sounds to appear to emanate from different directions. This 
may take the form of stereophonic headphones, or a more complex system such as 
disclosed in United States Patents 5533129 (Gefvert), 5307415 (Fosgate), article 
"Spatial Sound for Telepresence" by M.Hollier, D. Burraston, and A. Rimell in the 

30 British Telecom Technology Journal October 1997 or the applicant's own pending 
European Patent Application 97304218.7 filed on 17th June 1997. 

The output from the microphone 1 1 is encoded by an encoder 13 forming 
part of the customer equipment 10, and transmitted over the uplink 3 to the 



exchange equipment 100. Here it is combined with the other input channels 21, 
31 from the other participants into a concentrator 230 which combines the 
various inputs into an audio signal having a smaller number of channels 51, 52. 
These channels are transmitted over multiple-channel digital audio links 5 to the 
5 customer equipments 10, (20, 30) where they are first decoded by respective 
decoders 14, 24, 34 and provided to a spatialiser 15 for controlling the mixing of 
the channels to the generate a spatialised signal in the spaeker equipment 12. 

The concentrator 230 selects from the input channels 11, 21, 31 those 
carrying useful information - typically those carrying speech, and passes only these 

10 over the return link 5. This reduces the amount of information to be carried. A 
control channel 4 carries data identifying which channels were selected. The 
spatialiser 15 uses data^from the control channel to identify which of the original 
sound channels 11, 21, 31 it is receiving, and on which of the "N" channels 51, 
52 in the audio link each original channel is present, and constructs a spatialised 

15 signal using that information. The spatialised signal can be tailored to the individual 
customer, for example the number of talkers in the spatialised system, the 
customer's preferences as to where in the spatialised system each participant is to 
appear to be located, and which channels to include; for example the original 
talker or a simultaneous translation. In particular^ the user may exclude the channel 

20 representing his own input 11. 

Transmission efficiency is achieved because only the active subset N of the 
total number of channels M are transmitted at any one time. The subset is 
chosen using a voice controlled dynamic channel allocation algorithm in the N:M 
concentrator 230, A possible implementation of this is shown in Figure 8. Each 

25 input channel 11,21,31 is monitored by a respective analyser 231, 232, 233. As 
shown for analyser 231, the signal is subjected to a speech detection and analysis 
process 231b. This detects whether speech is present on the respective input 11, 
and gives a confidence value, indicative of how likely the signal contains speech. 
This ensures that low-level background speech is given a lower weight than 

30 speech clearly addressed to the microphones 11, 21, 31 etc. A value is also given 
for level, to ensure speech directed to the microphone is preferred over background 
noise, and the level information can be passed to the spatialistion system to select 
a coding algorithm appropriate to the information in the speech. In order to detect 
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and process the speech in the signals they first need to be decoded in a decoder 
231a (this may be dispensed with if the speech detection system 231b can 
operate with digitally encoded signals). 

A voting algorithm 234 then selects which of the inputs 11, 21, 31 have 
5 the clearest speech signals and controls a switch to direct each of those input 
channels 11, 21, 31 which have been selected to a respective one of the output 
channels 51, 52. Similar algorithms are used in Digital Circuit Multiplication 
Equipment (DCME) systems in international telephony circuits. Data relating the 
audio channels' content to the conference participants, and therefore the 

10 correspondence between the input channels 11, 21, 31 and output channels 51, 
52 is transmitted over the control channel 4. Alternatively, this data can be 
embedded in the encoded audio data. 

When there are fewer talkers identified than there are available output 
channels 51, 52, signal quality can be improved by using a less compressed 

15 digitisation scheme for those input channels selected, thereby using more than one 
output channel 51, 52 for each input channel selected. Telephone quality speech 
may be achieved at 8kbits/s, allowing eight talkers to be accommodated if the 
system has a 64kbit/sec capability. Should fewer talkers be detected, the 64kbit/s 
capability may be used instead to provide four 16 kbit/s audio channels, capable of 

20 carrying 'good' quality speech, or a mixture of channels at different bit rates, to 
allow the coding rates to be selected according to the initial signal quality, or so 
that the main talker could be passed at higher quality than the other talkers. 
Layered coding schemes can be used to allow graceful switching between data 
rates. 

25 The N-channel de-multiplexer and speech decoder 14 is shown in Figure 3. 

This receives the channels 51, 52, 53 etc carried in the audio downlink 5 and 
separates them in a demultiplexer 140. Each channel 51, 52, etc is then separately 
decoded in a respective decoder 141, 142, 143, etc for processing by the 
spatialiser 15. The decoders 141, 142, etc may operate according to different 

30 processes according to the individual coding algorithms used, under the control of 
the control signals carried in the control channel 4. 

A composite signal consisting of the summation of all input signals could 
also be transmitted. Such a signal could be used by users having monaural 
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receiving equipment, and may also be used by the spatialiser to generate an 
ambient background signal. Alternatively the spatialiser 15 may replace any of the 
channels 11, 21, 31 not selected by the concentrator 230, and therefore not 
represented in the N channel link 5, by "comfort noise"; that is low-level white 
5 noise, to avoid the aural impression of a void which would be occasioned by a 
complete absence of signal. 

The customer equipment 10 can be implemented using a desktop PC. 
Readily available PC sound card technology can provide the audio interface and 
processing needed for the simpler spatialising schemes. The more advanced sound 

10 cards with built in DSP technology could be used for more advanced schemes. The 
spatialiser 15 can use any one of a number of established techniques for creating 
an artificial audio environment as detailed in the Hoilier et al article referred to 
above. Spatialisation techniques may be summarised as follows. The simplest 
technique is "panning", where each signal is replayed with appropriate weighting 

15 via two or more loudspeakers such that it is perceived as emanating from the 
required direction. This is easy to implement, robust and may also be used with 
headphones. 

"Ambisonic" systems are more complicated and employ a technique 
known as wavefront reconstruction to provide a realistic spatial audio percept. 

20 They can create very good spatial sound, but only for a very small listening area 
and are thus only appropriate for single listeners. For headphone listening, 
"binaural" techniques can be used to provide very good spatialisaton. These use 
head-related transfer function (HRTF) filter pairs to recreate the soundfield that 
would have been present at the entrance to the ear canal for a sound emanating 

25 from any position in 3D space. This can give very good spatialisation and may be 
extended for use with loudspeakers, when it is known as "transaural". As with 
ambisonic systems, the correct listening position is very small. Any of these 
spatialisation techniques may be used with the present invention. 

The output of several spatialisers may be combined as shown in Figure 4, 

30 which shows a spatialiser group for a stereophonic output having left and right 
channels 12L, 1 2R. Each channel 51, 52, 53 is fed to a respective spatialiser 151, 
152, 153 which, under the control of a coefficient selector 150 control by the 
signals in the control channel 4, transmits an output 151L, 151R etc to each of a 
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series of combiners 15R. The processing used to create the outputs 151L. 

151R etc is operated under the control of the signal 4 such that each channel 
appears as a virtual sound source, having its own location in the space around the 
listener. 

5 The positions of virtual sources in three dimensional space could be 

determined automatically, or by manual control, with the user selecting the 
preferred positioning for each virtual sound source. For a video conference the 
positioning can be set to correspond with the appropriate video picture window. 
The video images may be sent by other means, or may be static images retrieved 

10 from a local storage by the individual user. 

If the spatialised sound is relayed via loudspeakers 12, rather than 
headphones, it will be necessary to prevent signals from the loudspeakers 12 being 
picked up by the microphone 11, re-transmitted and being heard as an echo at the 
distant sites 20, 30 etc. A technique for achieving this will be described later, 

1 5 with reference to Figure 1 1 . 

Figure 5 shows an alternative arrangement to that of Figure 4, in yvhich 
the spatialisation is computed in the conference 'bridge'. Each conference 
participant gets the same spatialised signals, thus simplifying the customer 
equipment. Figure 5 is similar in general arrangement to Figure 2, except that the 

20 decoder 14 and spatialiser 15 are part of the exchange equipment 200. The output 
from the spatialiser 15 is passed to an encoder 18 which transmits the required 
number of audio channels (e.g. two for a stereo system) to each customer 10, 20, 
30. This requires the number of channels in the downlink 5 to be equal to the 
number of audio channels in the spatialisation systems' outputs, instead of the 

25 number selected by the concentrator (plus the control channel 4) as in the 
embodiment of Figure 2. it also simplifies the customer equipment 10. However, 
this arrangement requires all customer installations 10, 20, 30 to have similar 
spatialisation systems, and in particular the same number of audio channels. It 
would also be more difficult to remove a talker's own voice from the signal he 

30 receives. Echo control would also be more complicated, and channel coding may 
degrade the spatialisation. 

Conventional analogue connections could be included in the conference by 
providing each analogue connection 43, 45 to the 'bridge' 200 with an encoder 
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42, as shown in Figure 6, to provide an input 41 to the concentrator. The output 
5 of the concentrator 230 is also decoded and combined in a unit 44 to provide a 
monaural conference signal 45 to the analogue user 40. 

In the embodiments of Figures 2 to 4 and Figure 5, if loudspeakers are 
5 used there is a need to control acoustic feedback ("echo") between the 
loudspeaker 12 and the microphone 11, which will result in signals being 
retransmitted back into the system. This will result in each user hearing one or 
more delayed versions of each signal (including their own transmissions) arriving 
from the other users. For a monophonic system echo control is usually done using 

10 an echo canceller as shown in Figure 9. The echo signal, represented by D is 
caused by the acoustic path J between the loudspeaker 12 and microphone 1 1 of 
equipment 10 in room B. The cancellation is achieved in an echo control unit 16 
by using an adaptive filter to create a synthetic model of the signal path such that 
the echo may be removed by subtraction. The signal E, returned to equipment 20 

15 in room A, is now free of echoes, containing only sounds that originated in Room 
B. The optimum modelling of the acoustic path J is usually achieved by the 
adaptive filter in a manner such that some appropriate function of the signal E is 
driven towards zero. Echo control using adaptive filters in this manner is well 
known. 

20 Multi-channel echo cancellation, as shown in Figure 10 for two channels, 

is more complex since there are two input channels 51, 52 and therefore two 
loudspeakers 1 2L, 1 2R. It is therefore necessary to model two echo paths K and L 
for each of the two return channels 3L, 3R. {The process is only shown for return 
channel 3L, using microphone 1 1L). Correct echo cancellation is only achieved if 

25 adaptive filters 161 L, 162L model the signal paths K and L respectively. (Two 
further filters 161R, 162R are required for the other return channel 3R) However, it 
is not possible to find a correct model for each path K, L independently without 
some difficult and expensive signal processing as described in "A better 
understanding and an improved soiution to the specific problems of stereophonic 

30 echo canceliation" (IEEE Transactions on speech and Audio processing. Vol 6, no 2 
March 1998, Authors: J Benesty, D R Morgan and M M Sondhi). 

The system described above with reference to Figure 4 employs linear 
artificial spatialisaiion techniques. Figure 1 1 shows how this, and the fact that the 
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echo from each loudspeaker 12L, 12R combines linearly at each microphone 11L, 
(11R, not shown), allows echo cancellation to be provided for each output channel 
3L, (3R) by having a separate adaptive filter 1611, 162L, 1631, (161R, 162R, 
163R) on each input channel 51, 52, 53. Thus the adaptive filter 161L will model 
5 the combination of the spatialiser 151 for the channel 51, and the echo path 
between the loudspeakers 12L and 12R and the microphone 1 1L. 

The invention could be applied to a conference situation in which there are 
several participants at each location, such as the video conference shown in Figure 
7. Close microphones 11 a, 1 1 b, 11 c, for example of the "tie-clip" type, are used 

10 to pick up the sound from each individual talker, and a talker location system 60 is 
used to keep track of their spatial position. The talker location system 60 may 
comprise a system of microphones which can identify the positions of sound 
sources. Relating the position of a sound source to that of the tie clip microphone 
1 1 currently in use makes it possible to learn the position of each talker by audio 

15 means alone. Alternatively, the system may detect the position of each user by 
means such as optical recognition of a badge carried by each user. In either case, 
the position data is passed to the far end (Room B), where correct spatiaiisation is 
reconstructed, for output by loudspeakers 12L, 12M, 12R etc. This would achieve 
a true spatial conference and overcome the associated echo control problems, 

20 since the "tie clip" microphones 31a, 31b, 31c have a limited range and will not 
detect the outputs from the loudspeakers 32 in the same room. 
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CLAIMS 

1 . A teleconferencing system comprising a conference bridge having a 
5 multichannel connection to each customer equipment, at least one customer 

equipment having means to separately process each channel to provide a plurality 
of outputs, each output representing one of the other participants. 

2. A system according to claim 1, wherein the customer equipment has 
10 means to combine the outputs representing each participant to provide a 

spatialised output in. which each participant is represented by a virtual sound 
source. 

3. A system according to claim 1 or 2, wherein, the conference bridge 
1 5 comprises a concentrator, having means to identify the currently active input 

channels and to transmit only those active channels over the multichannel 
connection, together with control information identifying the transmitted channels. 

4. A system according to any preceding claim, wherein the channel 
20 representing a given participant is excluded from the output provided to that 

participant. 

5. A system according to claim 4, comprising means in the customer 
equipment for excluding the said channel from the processing. 

25 

6. A system according to claim 4, comprising means for excluding the said 
channel from the multichannel transmission from the bridge to the respective 
participant. 

30 7. A system according to any preceding claim, the customer equipment 
having echo cancellation means comprising means for detecting correlations 
between the output signal from the customer equipment and the signals carried on 
individual input channels to the customer equipment representative of other users. 
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such correlations being indicative of acoustic feedback at the customer equipment, 
and means for cancelling such feedback signals in the output signal. 



8. A system according to claim 7, wherein the customer equipment 
5 comprises, for each channel of the output signal, a plurality of adaptive filters, 

each adaptive filter being arranged to model the echo path between a respective 
input channel and the respective output channel, and 

for each output channel there being provided a combiner for adding the outputs of 
the respective plurality of adaptive filters to generate an echo cancellation signal 
10 for the respective output channel. 

9. A method of providing teleconferencing services to a plurality of customer 
equipments, in which a multichannel connection is provided from a conference 
bridge to each customer equipment, in which at least one customer equipment 

15 processes each channel separately to provide a plurality of outputs, each 
representing one of the other participants. 

10. A method according to claim 9, wherein the conference bridge identifies 
the currently active input channels and transmits only those active channels over 

20 the multichannel connection, together with control information identifying the 
transmitted channels. 



25 
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ABSTRACT 
TELECONFERENCING SYSTEM 

A teleconferencing systenn comprises a conference bridge 100 having a 
multichannel connection 5 to each customer equipment 10. The customer 
equipment 10 has means 14 to separately process each channel 51, 52, 53 to 
provide one output representing each of the other participants. These outputs can 
be combined in a spatialiser 15 to provide a spatialised output 12 in which each 
participant is represented by a virtual sound source. The conference bridge 100 
comprises a concentrator 230, having means to identify the currently active input 
channels and to transmit only those active channels over the multichannel 
connection, together with control information identifying the transmitted channels. 




Figure (2) 
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1 

TELECONF ERENCIIMG SY.qTPM J 

This invention relates to audio teleconferencing systems. These are systems in 
which three or more participants, each having a telephone connection, can 
5 participate in a multi-way discussion. The essential part of a teleconference system 
is called the conference "bridge", and is where the audio signals from all the 
participants are combined. Conference bridges presently function by receiving 
audio from each of the participants, appropriately mixing the audio signals, and 
then distributing the mixed signal to each of the participants. All signal processing 
10 is concentrated in the bridge, and the result is monaural (that is, there is a single 
sound channel). This arrangement is shown in Figure 1, which will be described in 
• detail later. The principal drawback with this system is that the audio quality is 

monophonic, generally poor, it is very difficult to determine which participants are 
speaking at any one time, especially when the number of participants is large. 
15 According to the invention, there is provided a teleconferencing system 

comprising a conference bridge having a multichannel connection to each customer 
equipment, the customer equipment having means to separately process each 
channel to provide an output, preferably spatialised, representing each of the other 
participants. Preferably the conference bridge comprises a concentrator, having 
20 means to identify the currently active input channels and to transmit only those 
active channels over the multichannel connection, together with control 
information identifying the transmitted channels. This reduces the capacity 
jjP required by the multichannel connection. The control information identifying the 

active channels may be carried in a separate control channel, or as an overhead on 
25 the active subset of channels. In a preferred arrangement the channel representing 
a given participant is excluded from the output provided to that participant. This 
may be achieved by excluding that channel from the processing in the customer 
equipment, but is preferably achieved by excluding it from the multichannel 
transmission from the bridge to that participant, thereby reducing further the 
30 capacity required by the multichannel connection. 

Exemplary embodiments of the invention will now be described, by way of 
example, with reference to the drawings, in which: 

Figure 1 illustrates a conventional teleconference system; 
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Figure 2 illustrates a spatial audio teleconference system according to one 
embodiment of the invention; 

Figure 3 illustrates a N-channel speech decoder used in the embodiment of 
Figure 2; 

5 Figure 4 illustrates a N-Channel audio spatialiser used in the embodiment 

of Figure 2; 

Figure 5 illustrates a second embodiment of the invention; 
Figure 6 illustrates how the invention may be used with conventional 
PSTN channels; 

10 Figure 7 illustrates a variant of the invention for use with a video 

conference system; 

Figure 8 illustrates a voice switched concentrator which may be used in 
the embodiments of the invention. 

Figures 9, 10, and 1 1 illustrate various echo cancellation techniques. 

15 In the conventional system illustrated in Figure 1 the conference bridge 

located in the exchange equipment 100 receives signals from the various customer 
equipments 10, (20, 30 not shown) in response to sounds detected by respective 
microphones 11, 21, 31 etc. These signals are transmitted over the telephone 
network (1), to the exchange 100 at which the bridge is established. Generally the 

20 signals will travel by way of a local exchange (not shown) in which the analogue 
signals are converted to digital form, usually employing linear companding such as 
"A law" (as used for example in Europe) or "mu-Law" (as used for example in the 
United States of America) for onward transmission to the bridge exchange 100. On 
arrival at the bridge exchange 100, the bridge passes each incoming signal 11, 21, 

25 31 through a respective digital converter 111, 112, 1 1 3 to convert them from A 
Law to linear digital signals, and then passes the linear signals to a digital combiner 
120 to generate a.combined signal. This combined signal is re-converted to A law 
in a further digital converter 110, and the resulting signal transmitted over the 
telephone network (2) to each customer equipment 10, (20, 30) for conversion to 

30 sound in respective loudspeakers 12, 22, 32 etc. In this way the exchange 
equipment 100 acts as a "bridge" to allow one or more customer equipments 32 
to connect into a simple two-way connection between customer equipments 10, 
20. 
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The systems illustrated in Figures 2 to 8 replace the conventional 
conference bridge systenn of Figure 1 with a multicast system in which several 
channels can be transmitted to each participant, using a multi-channel link 
comprising an uplink 3, and a downlink comprising a control channel 4 and a digital 
5 audio downlink 5 comprising several channels 51, 52. Participants with suitable 
equipment can then process these channels 51, 52 in various ways as will be 
described. 

The transmission medium used for the uplink 3 and downlink 4,5 can be 
any suitable medium. ISDN (Integrated Services Data Network) technology or LAN 

10 (Local Area Network) - respectively public and private data networks - are the 
favoured transmission options since they provide adequate data rate and low 
latency - delays due to coding and transmission buffering. However, they are 
expensive and so far have a low penetration in the market place. Internet Protocol 
techniques are more widely used, but have poor latency and unreliable data rates. 

15 Being packet systems, they are less suited to voice applications. It is also possible 
to use the conventional PSTN (public switch telephone system) with a speech 
band modem. The latest internet type modems provide up to 56kbit/s downstream 
(links 4,5: digital network down to the customer via local loop), and up to 
28.8kbit/s upstream (link 3). They are low cost and are commonly bundled into PC 

20 packages. Ideally a system should be able to work with all of the above, and with 
standard analogue PSTN available as a backup. 

The signal mixing can take place either in the user's terminal equipment, or 
in a centralised processing platform as shown in Figure 2. In Figure 2 the customer 
equipment 10 contains a microphone 11 and loudspeaker system 12 as before. 

25 However, the loudspeaker system 1 2 is a spatialised system - that is, it has two or 
more channels to allow sounds to appear to emanate from different directions. This 
may take the form of stereophonic headphones, or a more complex system such as 
disclosed in United States Patents 5533129 (Gefvert), 5307415 (Fosgate), article 
"Spatial Sound for Telepresence" by M.Hollier, D. Burraston, and A. Rimell in the 

30 British Telecom Technology Journal, October 1997 or the applicant's own pending 
European Patent Application 97304218.7 filed on 17th June 1997. 

The output from the microphone 1 1 is encoded by an encoder 13 forming 
part of the customer equipment 10, and transmitted over the uplink 3 to the 
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exchange equipment 100. Here it is combined with the other input channels 21, 
31 from the other participants into a concentrator 230 which combines the 
various inputs into an audio signal having a smaller number of channels 51, 52. 
These channels are transmitted over multiple-channel digital audio links 5 to the 
5 customer equipments 10, {20, 30) where they are first decoded by respective 
decoders 14, 24, 34 and provided to a spatialiser 15 for controlling the mixing of 
the channels to the generate a spatiaiised signal in the spaeker equipment 1 2. 

The concentrator 230 selects from the input channels 11, 21, 31 those 
carrying useful information - typically those carrying speech, and passes only these 

10 over the return link 5. This reduces the amount of information to be carried. A 
control channel 4 carries data identifying which channels were selected. The 
spatialiser 15 uses data from the control channel to identify which of the original 
sound channels 11, 21, 31 it is receiving, and on which of the "N" channels 51, 
52 in the audio link each original channel is present, and constructs a spatiaiised 

15 signal using that information. The spatiaiised signal can be tailored to the individual 
customer, for example the number of talkers in the spatiaiised system, the 
customer's preferences as to where in the spatiaiised system each participant is to 
appear to be located, and which channels to include; for example the original 
talker or a simultaneous translation. In particular, the user may exclude the channel 

20 representing his own input 1 1 . 

Transmission efficiency is achieved because only the active subset N of the 
total number of channels M are transmitted at any one time. The subset is 
chosen using a voice controlled dynamic channel allocation algorithm in the N:M 
concentrator 230. A possible implementation of this is shown in Figure 8. Each 

25 input channel 11,21,31 is monitored by a respective analyser 231, 232, 233. As 
shown for analyser 231, the signal is subjected to a speech detection and analysis 
process 231b. This detects whether speech is present on the respective input 11, 
and gives a confidence value, indicative of how likely the signal contains speech. 
This ensures that low-level background speech is given a lower weight than 

30 speech clearly addressed to the microphones 11, 21, 31 etc. A value is also given 
for level, to ensure speech directed to the microphone is preferred over background 
noise, and the level information can be passed to the spatialistion system to select 
a coding algorithm appropriate to the information in the speech. In order to detect 
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and process the speech in the signals they first need to be decoded in a decoder 
231a (this nnay be dispensed with if the speech detection system 231b can 
operate with digitally encoded signals). 

A voting algorithm 234 then selects which of the inputs 11, 21, 31 have 
5 the clearest speech signals and controls a switch to direct each of those input 
channels 11, 21, 31 which have been selected to a respective one of the output 
channels 51, 52. Similar algorithms are used in Digital Circuit Multiplication 
Equipment (DCME) systems in international telephony circuits. Data relating the 
audio channels' content to the conference participants, and therefore the 

10 correspondence between the input channels 11, 21, 31 and output channels 51, 
52 is transmitted over the control channel 4. Alternatively, this data can be 
embedded in the encoded audio data. 

When there are fewer talkers identified than there are available output 
channels 51, 52, signal quality can be improved by using a less compressed 

15 digitisation scheme for those input channels selected, thereby using more than one 
output channel 51, 52 for each input channel selected. Telephone quality speech 
may be achieved at 8kbits/s, allowing eight talkers to be accommodated if the 
system has a 64kbit/sec capability. Should fewer talkers be detected, the 64kbit/s 
capability may be used instead to provide four 16 kbit/s audio channels, capable of 

20 carrying 'good' quality speech, or a mixture of channels at different bit rates, to 
allow the coding rates to be selected according to the initial signal quality, or so 
that the main talker could be passed at higher quality than the other talkers. 
Layered coding schemes can be used to allow graceful switching between data 
rates. 

25 The N-channel de-multiplexer and speech decoder 14 is shown in Figure 3. 

This receives the channels 51, 52, 53 etc carried in the audio downlink 5 and 
separates them in a demultiplexer 140. Each channel 51, 52, etc is then separately 
decoded in a respective decoder 141, 142, 143, etc for processing by the 
spatialiser 15. The decoders 141, 142, etc may operate according to different 

30 processes according to the individual coding algorithms used, under the control of 
the control signals carried in the control channel 4. 

A composite signal consisting of the summation of all input signals could 
also be transmitted. Such a signal could be used by users having monaural 
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receiving equipment, and nnay also be used by the spatialiser to generate an 
ambient background signal. Alternatively the spatialiser 15 may replace any of the 
channels 11, 21, 31 not selected by the concentrator 230, and therefore not 
represented in the N channel link 5, by "comfort noise"; that is low-level white 
5 noise, to avoid the aural impression of a void which would be occasioned by a 
complete absence of signal. 

The customer equipment 10 can be implemented using a desktop PC. 
Readily available PC sound card technology can provide the audio interface and 
processing needed for the simpler spatialising schemes. The more advanced sound 

10 cards with built in DSP technology could be used for more advanced schemes. The 
spatialiser 15 can use any one of a number of established techniques for creating 
an artificial audio environment as detailed in the Hollier et al article referred to 
above. Spatialisation techniques may be summarised as follows. The simplest 
technique is "panning", where each signal is replayed with appropriate weighting 

15 via two or more loudspeakers such that it is perceived as emanating from the 
required direction. This is easy to implement, robust and may also be used with 
headphones. 

"Ambisonic" systems are more complicated and employ a technique 
known as wavefront reconstruction to provide a realistic spatial audio percept. 

20 They can create very good spatial sound, but only for a very small listening area 
and are thus only appropriate for single listeners. For headphone listening, 
"binaural" techniques can be used to provide very good spatialisaton. These use 
head-related transfer function (HRTF) filter pairs to recreate the soundfield that 
would have been present at the entrance to the ear canal for a sound emanating 

25 from any position in 3D space. This can give very good spatialisation and may be 
extended for use with loudspeakers, when it is known as "transaural". As with 
ambisonic systems, the correct listening position is very small. Any of these 
spatialisation techniques may be used with the present invention. 

The output of several spatialisers may be combined as shown in Figure 4, 

30 which shows a spatialiser group for a stereophonic output having left and right 
channels 1 2L, 12R. Each channel 51, 52, 53 is fed to a respective spatialiser 151, 
152, 153 which, under the control of a coefficient selector 150 control by the 
signals in the control channel 4, transmits an output 151L, 151R etc to each of a 
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series of combiners 15L, 1 5R. The processing used to create the outputs 151U 
151R etc is operated under the control of the signal 4 such that each channel 
appears as a virtual sound source, having its own location in the space around the 
listener. 

5 The positions of virtual sources in three dimensional space could be 

determined automatically, or by manual control, with the user selecting the 
preferred positioning for each virtual sound source. For a video conference the 
positioning can be set to correspond with the appropriate video picture window. 
The video images may be sent by other means, or may be static images retrieved 

10 from a local storage by the individual user. 

If the spatialised sound is relayed via loudspeakers 12, rather than 
headphones, it will be necessary to prevent signals from the loudspeakers 12 being 
picked up by the microphone 1 1, re-transmitted and being heard as an echo at the 
distant sites 20, 30 etc. A technique for achieving this will be described later, 

1 5 with reference to Figure 1 1 . 

Figure 5 shows an alternative arrangement to that of Figure 4, in which 
the spatialisation is computed in the conference 'bridge'. Each conference 
participant gets the same spatialised signals, thus simplifying the customer 
equipment. Figure 5 is similar in general arrangement to Figure 2, except that the 

20 decoder 14 and spatialiser 15 are part of the exchange equipment 200. The output 
from the spatialiser 15 is passed to an encoder 18 which transmits the required 
number of audio channels (e.g. two for a stereo system) to each customer 10, 20, 
30. This requires the number of channels in the downlink 5 to be equal to the 
number of audio channels in the spatialisation systems' outputs, instead of the 

25 number selected by the concentrator (plus the control channel 4) as in the 
embodiment of Figure 2. It also simplifies the customer equipment 10. However, 
this arrangement requires all customer installations 10, 20, 30 to have similar 
spatialisation systems, and in particular the same number of audio channels. It 
would also be more difficult to remove a talker's own voice from the signal he 

30 receives. Echo control would also be more complicated, and channel coding may 
degrade the spatialisation. 

Conventional analogue connections could be included in the conference by 
providing each analogue connection 43, 45 to the 'bridge' 200 with an encoder 
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42, as shown in Figure 6, to provide an input 41 to the concentrator. The output 
5 of the concentrator 230 is also decoded and combined in a unit 44 to provide a 
monaural conference signal 45 to the analogue user 40. 

In the embodiments of Figures 2 to 4 and Figure 5, if loudspeakers are 
5 used there is a need to control acoustic feedback ("echo") between the 
loudspeaker 12 and the microphone 11, which will result in signals being 
retransmitted back into the system. This will result in each user hearing one or 
more delayed versions of each signal (including their own transmissions) arriving 
from the other users. For a monophonic system echo control is usually done using 

10 an echo canceller as shown in Figure 9. The echo signal, represented by D is 
caused by the acoustic path J between the loudspeaker 12 and microphone 1 1 of 
equipment 10 in room B. The cancellation is achieved in an echo control unit 16 
by using an adaptive filter to create a synthetic model of the signal path such that 
the echo may be removed by subtraction. The signal E, returned to equipment 20 

15 in room A, is now free of echoes, containing only sounds that originated in Room 
B. The optimum modelling of the acoustic path J is usually achieved by the 
adaptive filter in a manner such that some appropriate function of the signal E is 
driven towards zero. Echo control using adaptive filters in this manner is well 
known. 

20 Multi-channel echo cancellation, as shown in Figure 10 for two channels, 

is more complex since there are two input channels 51, 52 and therefore two 
loudspeakers 1 2L, 1 2R. It is therefore necessary to model two echo paths K and L 
for each of the two return channels 3L, 3R. (The process is only shown for return 
channel 3L, using microphone 11L). Correct echo cancellation is only achieved if 

25 adaptive filters 161L, 162L model the signal paths K and L respectively. (Two 
further filters 161R, 162R are required for the other return channel 3R) However, it 
is not possible to find a correct model for each path K, L independently without 
some difficult and expensive signal processing as described in "A better 
understanding and an improved solution to tlie specific problems of stereophonic 

30 echo cancellation" (IEEE Transactions on speech and Audio processing. Vol 6, no 2 
March 1998. Authors: J Benesty, D R Morgan and M M Sondhi). 

The system described above with reference to Figure 4 employs linear 
artificial spatialisation techniques. Figure 1 1 shows how this, and the fact that the 



08/04/98 1(5:12 u:\paients^wor<l\255 1 6.doc 



echo from each loudspeaker 1 2L, 12R combines linearly at each microphone 1 1 L, 
(1 1R, not shown), allows echo cancellation to be provided for each output channel 
3L, (3R) by having a separate adaptive filter 161L, 162U 163L, I161R, 162R, 
163R) on each input channel 51, 52, 53. Thus the adaptive filter 1 61 L will model 
5 the combination of the spatiaiiser 151 for the channel 51, and the echo path 
between the loudspeakers 1 2L and 1 2R arid the microphone 1 1 L. 

The invention could be applied to a conference situation in which there are 
several participants at each location, such as the video conference shown in Figure 
7. Close microphones 11a, lib, 1 1c, for example of the "tie-clip" type, are used 

10 to pick up the sound from each individual talker, and a talker location system 60 is 
used to keep track of their spatial position. The talker location system 60 may 
comprise a system of microphones which can identify the positions of sound 
sources. Relating the position of a sound source to that of the tie clip microphone 
1 1 currently in use makes it possible to learn the position of each talker by audio 

15 means alone. Alternatively, the system may detect the position of each user by 
means such as optical recognition of a badge carried by each user. In either case, 
the position data is passed to the far end (Room B), where correct spatialisation is 
reconstructed, for output by loudspeakers 12L, 12M, 12R etc. This would achieve 
a true spatial conference and overcome the associated echo control problems, 
20 since the "tie clip" microphones 31a, 31b, 31c have a limited range and will not 
detect the outputs from the loudspeakers 32 in the same room. 
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CLAIMS 

1. A teleconferencing systenn comprising a conference bridge having a 
5 multichannel connection to each customer equipment, at least one customer 

equipment having means to separately process each channel to provide a plurality 
of outputs, each output representing one of the other participants. 

2. A system according to claim 1, wherein the customer equipment has 
10 means to combine the outputs representing each participant to provide a 

spatialised output in which each participant is represented by a virtual sound 
source. 

3. A system according to claim 1 or 2, wherein, the conference bridge 
15 comprises a concentrator, having means to identify the currently active input 

channels and to transmit only those active channels over the multichannel 
connection, together with control information identifying the transmitted channels. 

4. A system according to any preceding claim, wherein the channel 
20 representing a given participant is excluded from the output provided to that 

participant. 

5. A system according to claim 4, comprising means in the customer 
equipment for excluding the said channel from the processing. 

25 

6. A system according to claim 4, comprising means for excluding the said 
channel from the. multichannel transmission from the bridge to the respective 
participant. 

30 "7. A system: according to any preceding claim, the customer equipment 

having echo cancellation means comprising means for detecting correlations 
between the output signal from the customer equipment and the signals carried on 
individual input channels to the customer equipment representative of other users. 
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such correlations being indicative of acoustic feedback at the customer equiprrient, 
and means for cancelling such feedback signals in the output signal. 

8. A system according to . claim 1 , wherein the customer equipment 
5 comprises, for each channel of the output signal, a plurality of adaptive filters, 

each adaptive filter being arranged to model the echo path between a respective 
input channel and the respective output channel, and 

for each output channel there being provided a combiner for adding the outputs of 
the respective plurality of adaptive filters to generate an echo cancellation signal 
10 for the respective output channel. 

9. A method of providing teleconferencing services to a plurality of customer 
equipments, in which a multichannel connection is provided from a conference 
bridge to each customer equipment, in which at least one customer equipment 

15 processes each channel separately to provide a plurality of outputs, each 
representing one of the other participants. 

10. A method according to claim 9, wherein the conference bridge identifies 
the currently active input channels and transmits only those active channels over 

20 the multichannel connection, together with control information identifying the 
transmitted channels. 



25 
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ABSTRACT 
TELECONFERENCING SYSTEM 

A teleconferencing system comprises a conference bridge 100 having a 
5 multichannel connection 5 to each customer equipment 10. The customer 
equipment 10 has means 14 to separately process each channel 51, 52, 53 to 
provide one output representing each of the other participants. These outputs can 
be combined in a spatialiser 15 to provide a spatialised output 12 in which each 
participant is represented by a virtual sound source. The conference bridge 100 
10 comprises a concentrator 230, having means to identify the currently active input 
channels and to transmit only those active channels over the multichannel 
connection, together with control information identifying the transmitted channels. 

Figure (2) 

15 
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